{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train Ad-Hoc Model in this Notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q xgboost==0.90"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "import pandas as pd\n",
    "\n",
    "sess   = sagemaker.Session()\n",
    "bucket = sess.default_bucket()\n",
    "role = sagemaker.get_execution_role()\n",
    "region = boto3.Session().region_name\n",
    "\n",
    "sm = boto3.Session().client(service_name='sagemaker', region_name=region)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load the Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob\n",
    "import pandas as pd\n",
    "import os\n",
    "\n",
    "def load_dataset(path, sep, header):\n",
    "    if (os.path.isdir(path)):\n",
    "        data = pd.concat([pd.read_csv(f, sep=sep, header=header) for f in glob.glob('{}/*.csv'.format(path))], ignore_index = True)\n",
    "    else:\n",
    "        data = pd.concat([pd.read_csv(f, sep=sep, header=header) for f in glob.glob('{}'.format(path))], ignore_index = True)        \n",
    "\n",
    "    labels = data.iloc[:,0]\n",
    "    features = data.drop(data.columns[0], axis=1)\n",
    "    \n",
    "    if header==None:\n",
    "        # Adjust the column names after dropping the 0th column above\n",
    "        # New column names are 0 (inclusive) to len(features.columns) (exclusive)\n",
    "        new_column_names = list(range(0, len(features.columns)))\n",
    "        features.columns = new_column_names\n",
    "\n",
    "    return features, labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, y_train = load_dataset(path='./amazon_reviews_us_Digital_Software_v1_00_autopilot_header.csv', sep=',', header=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = X_train[:500]\n",
    "y_train = y_train[:500]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "X_train.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_train.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Transform the Raw Text into TF/IDF Embeddings\n",
    "\n",
    "Before we train our XGBoost model, we need to convert our raw review text features into numeric features.  \n",
    "\n",
    "To do this, we use a technique called `Term Frequency Inverse Document Frequency` or `TF/IDF`.  Here is the Wikipedia definition:  https://en.wikipedia.org/wiki/Tf%E2%80%93idf\n",
    "\n",
    "TF/IDF takes into account the frequency of a term or word in a document weighted by the overall frequency across the entire corpus.\n",
    "This way, common words and terms are naturally weighted lower because of the Inverse Document Frequency portion of TF/IDF.  The end result is a vector-based representation of each review - commonly called an embedding!  We will explore other types of embeddings later.\n",
    "\n",
    "TF/IDF will creates these vectors - also called embeddings - with 1,000's of values for each text-based review... 1 value for each term or word.  Therefore, we should reduce our feature-space down to the top-K dimensions that describe our dataset using a technique called `Singular Value Decomposition` or `SVD`.  Here is the Wikipedia definition:  https://en.wikipedia.org/wiki/Singular_value_decomposition.\n",
    "\n",
    "In Natural Language Processing (NLP) context, the combination of TF/IDF and SVD is typically called `Latent Semantic Analysis` or `LSA`:  https://en.wikipedia.org/wiki/Latent_semantic_analysis.  Every modern NLP library supports TF/IDF, SVD, and LSA.\n",
    "\n",
    "Note that we have to manually chose a set of hyper-parameters for TF/IDF including `max_df`, `min_df`, `max_features`, etc.  We also need to specify hyper-parameters for `TruncatedSVD` including `n_components` for the number of top-K dimensions that describe our dataset.  AutoPilot will automatically chose these hyper-parameters for us.  But, for now, we must chose them ourselves based on intuition, experience, and sometimes luck!\n",
    "\n",
    "Lastly, we scale our features to squash outliers by centering and scaling our data around a normal Gaussian distribution with mean 0 and unit variance.  If we don't perform this scaling, features with large variance may dominate and prevent our model from fitting to our dataset.\n",
    "\n",
    "Altogether, we create a `pipeline` of feature transformations.  Note that this entire pipeline must be applied to our features during both model training *and* model predicting.  This highlights one of the key difficulties of `ad-hoc` model training like we are doing here.  How do I re-use this pipeline outside of this notebook?  We will cover this soon."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define the Transform Function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import TruncatedSVD\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "def feature_transform_fn(df_text, column_name, num_components):\n",
    "    text_processors = Pipeline(\n",
    "        steps=[\n",
    "            (\n",
    "                'tfidfvectorizer',\n",
    "                TfidfVectorizer(\n",
    "                    max_df=0.25,                                       \n",
    "                    min_df=.0025,\n",
    "                    analyzer='word',\n",
    "                    max_features=10000\n",
    "                )\n",
    "            )\n",
    "        ]\n",
    "    )\n",
    "\n",
    "    column_transformer = ColumnTransformer(\n",
    "        transformers=[('text_processing', text_processors, df_text.columns.get_loc(column_name))]\n",
    "    )\n",
    "\n",
    "    pipeline = Pipeline(\n",
    "        steps=[\n",
    "            ('column_transformer', column_transformer), \n",
    "            ('dimension_reducer', TruncatedSVD(n_components=num_components)),\n",
    "            ('standard_scaler', StandardScaler())\n",
    "        ]\n",
    "    )\n",
    "\n",
    "    return pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run the Transform\n",
    "_This will take a minute or two.  Please be patient._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = feature_transform_fn(df_text=X_train, column_name='review_body', num_components=300)\n",
    "np_tfidf = pipeline.fit_transform(X_train)\n",
    "df_tfidf = pd.DataFrame(np_tfidf)\n",
    "df_tfidf.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Show the learned TF/IDF features for each sentence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vectorizer_tfidf = pipeline \\\n",
    "    .named_steps['column_transformer'] \\\n",
    "    .transformers[0][1].named_steps['tfidfvectorizer']\n",
    "\n",
    "vectorizer_tfidf_fitted = vectorizer_tfidf.fit_transform(X_train['review_body'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_vectorizer_tfidf = pd.DataFrame(vectorizer_tfidf_fitted.toarray())\n",
    "df_vectorizer_tfidf.columns = vectorizer_tfidf.get_feature_names()\n",
    "df_vectorizer_tfidf.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_with_tfidf = pd.merge(X_train['review_body'], \n",
    "                                      df_vectorizer_tfidf,\n",
    "                                      left_index=True,\n",
    "                                      right_index=True)\n",
    "df_with_tfidf.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train the Model using XGBoost and our New TF/IDF Features\n",
    "\n",
    "We train using the TF/IDF features from above - as well as our labels.\n",
    "\n",
    "_This will take a few minutes.  Please be patient._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import xgboost as xgb\n",
    "from xgboost import XGBClassifier\n",
    "\n",
    "objective  = 'binary:logistic'\n",
    "max_depth  = 5\n",
    "num_round  = 1\n",
    "\n",
    "xgb_estimator = XGBClassifier(objective=objective,\n",
    "                              num_round=num_round,\n",
    "                              max_depth=max_depth)\n",
    "\n",
    "xgb_estimator.fit(df_tfidf, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Calculate Test Metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TODO:  Load test dataset (must from original TF/IDfdataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!ls -al ./data/test.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#X_test, y_test = load_dataset(path='./data/test.csv', sep=',', header=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#X_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#X_test.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Test our Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert Raw Test Inputs into TF/IDF Embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np_tfidf_test = feature_transform_fn(df_text=X_test, column_name='review_body', num_components=300).fit_transform(X_test)\n",
    "df_tfidf_test = pd.DataFrame(np_tfidf_test)\n",
    "df_tfidf_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds_test = xgb_estimator.predict(df_tfidf_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score, precision_score, classification_report, confusion_matrix\n",
    "\n",
    "print('Test Accuracy: ', accuracy_score(y_test, preds_test))\n",
    "print('Test Precision: ', precision_score(y_test, preds_test, average=None))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(classification_report(y_test, preds_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import seaborn as sn\n",
    "# import pandas as pd\n",
    "# import matplotlib.pyplot as plt\n",
    "\n",
    "# df_cm_test = confusion_matrix(y_test, preds_test)\n",
    "# df_cm_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import itertools\n",
    "# import numpy as np\n",
    "\n",
    "# import matplotlib.pyplot as plt\n",
    "# %matplotlib inline\n",
    "# %config InlineBackend.figure_format='retina'\n",
    "\n",
    "# def plot_conf_mat(cm, classes, title, cmap = plt.cm.Greens):\n",
    "#     print(cm)\n",
    "#     plt.imshow(cm, interpolation='nearest', cmap=cmap)\n",
    "#     plt.title(title)\n",
    "#     plt.colorbar()\n",
    "#     tick_marks = np.arange(len(classes))\n",
    "#     plt.xticks(tick_marks, classes, rotation=45)\n",
    "#     plt.yticks(tick_marks, classes)\n",
    "\n",
    "#     fmt = 'd'\n",
    "#     thresh = cm.max() / 2.\n",
    "#     for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n",
    "#         plt.text(j, i, format(cm[i, j], fmt),\n",
    "#         horizontalalignment=\"center\",\n",
    "#         color=\"black\" if cm[i, j] > thresh else \"black\")\n",
    "\n",
    "#         plt.tight_layout()\n",
    "#         plt.ylabel('True label')\n",
    "#         plt.xlabel('Predicted label')\n",
    "\n",
    "# # Plot non-normalized confusion matrix\n",
    "# plt.figure()\n",
    "# fig, ax = plt.subplots(figsize=(6,4))\n",
    "# plot_conf_mat(df_cm_test, classes=['Not Positive Sentiment', 'Positive Sentiment'], \n",
    "#                           title='Confusion matrix')\n",
    "# plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# from sklearn import metrics\n",
    "\n",
    "# import matplotlib.pyplot as plt\n",
    "# %matplotlib inline\n",
    "# %config InlineBackend.figure_format='retina'\n",
    "\n",
    "# auc = round(metrics.roc_auc_score(y_test, preds_test), 4)\n",
    "# print('AUC is ' + repr(auc))\n",
    "\n",
    "# fpr, tpr, _ = metrics.roc_curve(y_test, preds_test)\n",
    "\n",
    "# plt.title('ROC Curve')\n",
    "# plt.plot(fpr, tpr, 'b',\n",
    "# label='AUC = %0.2f'% auc)\n",
    "# plt.legend(loc='lower right')\n",
    "# plt.plot([0,1],[0,1],'r--')\n",
    "# plt.xlim([-0.1,1.1])\n",
    "# plt.ylim([-0.1,1.1])\n",
    "# plt.ylabel('True Positive Rate')\n",
    "# plt.xlabel('False Positive Rate')\n",
    "# plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Save Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import os\n",
    "\n",
    "# import pickle as pkl\n",
    "\n",
    "# # See https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html\n",
    "# # Need to save with joblib or pickle.  `xgb.save_model()` does not save feature_names\n",
    "# model_dir  = './models/notebook/'\n",
    "\n",
    "# os.makedirs(model_dir, exist_ok=True)\n",
    "# model_path = os.path.join(model_dir, 'xgboost-model')\n",
    "\n",
    "# pkl.dump(xgb_estimator, open(model_path, 'wb'))\n",
    "\n",
    "# print('Wrote model to {}'.format(model_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Restore Model \n",
    "This simulates restoring a model within an application."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# import pickle as pkl\n",
    "# import os\n",
    "\n",
    "# model_dir  = './models/notebook/'\n",
    "# model_path = os.path.join(model_dir, 'xgboost-model')\n",
    "\n",
    "# xgb_estimator_restored = pkl.load(open(model_path, 'rb'))\n",
    "\n",
    "# type(xgb_estimator_restored)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Confirm the Predictions are OK from the Restored Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# preds_test = xgb_estimator_restored.predict(df_tfidf_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# from sklearn.metrics import accuracy_score, precision_score, classification_report, confusion_matrix\n",
    "\n",
    "# print('Test Accuracy: ', accuracy_score(y_test, preds_test))\n",
    "# print('Test Precision: ', precision_score(y_test, preds_test, average=None))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
